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I. INTRODUCTION 


Simultaneous localization and mapping (SLAM) [1, 2] is considered as an essential technology 
for autonomous mobile robots to effectively and safely operate in unknown environments. SLAM 
consists in the concurrent construction of a model of the environment (the map), and the estimation 
of the state of the robot moving within it [3]. Visual odometry (VO) ca be considered a reduced 
SLAM system, in which the loop closure (or place recognition) module disabled [3]. Monocular 
vision-based VO is incapable of recovering the metric scale and suffers from illumination change, 
less texture, and etc. Therefore, inertial measurement unit (IMU) has been extensively introduced 
to integrate with vision system to construct the visual-inertial SLAM (VI-SLAM ) or visual-inertial 
odometry (VIO). In the last decades, a large amount of research have significantly promoted the 
development and applications of the visual or visual-inertial system. A recent survey of the current 
state of SLAM can be found in [3]. A summary of the most representative visual and visual-inertial 
systems is listed in [4] 

The VINS-Mono [5, 6] is a monocular visual-inertial 6 DOF state estimator proposed by Aerial 
Robotics Group of HKUST in 2017. It can be performed on MAVs, smartphones and many other 
intelligent platforms. Because of the significant robustness, accuracy and scalability, it has gained 
extensive attention worldwide. The whole framework of the VINS-Mono is shown as Fig. 1. The 
main contribution includes: 

1. A robust initialization procedure which can produce a comparatively accurate estimation 
of the visual scale, gravity, velocity and gyroscope bias. 
Sliding window-based local optimization. 


3. Online relocalization and 4 DOF global pose graph optimization. 
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Fig. 1 Framework overview of the VINS-Mono (Shaojie Shen, 2017). 


This manuscript is organized as following. The process of the IMU preintegration is described 
in Section II. Section III presents the visual/inertial co-initialization procedure. The tightly-coupled 
nonlinear optimization is then derived in Section IV. Section V deduces the marginalization. The 
derivation of the global optimization with GPS in VINS-Fusion [7], which is an extension of VIN- 


Mono, are provided in Section VI. Refer to the Appendix for details of some equation derivation. 


NOMENCLATURE 

a) Matrices are denoted as upper case bold letters. 

b) Vectors are denoted as lower case bold italic letters. 

C) Scalar 1s denoted as lower case italic letters. 

d) The coordinate frames involved in the vector transformation are denoted as superscript and 
subscript. For vectors, the superscript denotes the projected coordinate system. 

e) *, estimated or computed values. ——— 

f) *, observed or measured values. ] biben 

g)a,,element of vector a on x axis. — 


Il. IMU PREINTEGRATION 


2.1 IMU Preintegration in Continuous Time 


The IMU preintegration is proposed in [8, 9]. In most practical applications, the IMU data rate 
is larger than that of the camera, which means that there are usually several IMU measurements 
between two consecutive frames, as shown in Fig. 2. First of all, we have to know that an IMU 
consists of a gyroscope and an accelerometer to measure the angular rate and acceleration of the 


IMU w.r.t. the inertial frame, respectively. For accelerometer, its measurement can be written as 


f? =a -g (1) 
where f is the special force, œ is the true acceleration of the IMU, g is the local gravity. It 


means that the output of the accelerometer is not the true acceleration of the IMU, but the 


acceleration minus gravity. For example, if the IMU frame is defined as right-front-up, its 


measurement will be x == g’ z 9.8m/s when the IMU keeps static and level. 
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Fig. 2 Illustration of the IMU preintegration (Shaojie Shen, 2017). 


Given two time instants that correspond to image frames b, and b,,, ,the state variables are 


constrained by inertial measurements during time interval [k, k + 1| , 
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NOTICE: Here, g” does not denote the gravity in world coordinates but the projection of fè in 


the world frame. 
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For the last equation in (2), according to [10], we have 
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In this manuscript, we define that g = a, q, | = |, dx Wy q; | . 


Changing the reference frame w ofIMU propagation to local frame b, , we can only integrate 


the parts which are related to linear acceleration & and angular velocity @ as follows: 
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2.2 IMU Preintegration in Discrete Time 


As for the discrete-time implementation of the IMU preintegration, we can apply Mid-point 


method as following 
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NOTICE: Strictly speaking, the preintegration values of position a and velocity By do not 


have any physical meaning, since the gravity is ignored in the integral process. However, we can 
imagine a zero gravity space in which the measurement of accelerometer will be the real 


acceleration, then Eq. (6) and (7) become easy to understand. 
2.3 Error-state Kinematics in Continuous Time 


Inspired by [10], we can write the error-state equations of the kinematics of an inertial system. 
We introduce the error perturbation analysis as following 
X=x+0x (8) 


where x isthe ideal (or nominal [10]) value of the sensor measurements or the body state without 


any error; X is the calculated (or observed, or true [10]) value of measurements and body state 
which contain errors OX. 
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Assume that the acceleration bias and gyroscope bias are random walk, whose derivatives are 
, : : , 2 2 ‘ . : 

Gaussian white noise, i.e. n, ~ N(0,05 ) n, ~ N(0,0; ). The noise in acceleration a,, and 

gyroscope measurements @,, are treated as the same. 


Then, we have the error-state kinematics, 
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Derivations of the differential equation of the velocity and attitude errors are developed as follows, 


the second-order small terms are so trivial that we ignore it, 
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Then, we have 
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As we all know, 
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Within a small time interval, we have 
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where J; isthesub-block matrix in J M whose location is corresponding to & , the others 


are of the similar meaning. 
2.4 Error-state Kinematics in Discrete Time 


We assume that the sensor biases between two IMU measurements are constant. According to 


Eq. (17) and using the Mid-point integration for discrete time implementation, we have 
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We can rewrite it in matrix form as follows (This is mostly according with the code. See 


"midPointIntegration" in “integration base.h" in [6].), 
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Ill. INITIALIZATION 


The initialization of the VINS-Mono can be divided into four main steps, as shown in Fig. 3. 
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Fig. 3. Initialization procedure 
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It starts with a vision-only structure from motion (SFM) in the sliding window. Firstly, we 
recover the relative pose (up-to-scale) between two image frames by Five-point method when we 
can find a previous frame which has more than thirty tracked features, and the average parallax is 


more than twenty pixels between it and the latest frame. Then, we set this frame (not always the 


first frame in the code) as the reference frame ( c ) for the moment. Subsequently, we triangulate 


all the features between the two frames. We perform EPnP [11] to estimate poses of other frames 
in the whole window based on the 3D features. At last, a bundle adjustment is performed to 
minimize the total reprojection error to optimize all poses in the window. 


We can transform the poses from camera center to IMU center according to the extrinsic 


parameters p. q.) 
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Changing the reference frame from the world frame w to the c, frame in the SFM, we can 


rewrite the IMU preintegration in (5) as follow: 
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From (28) and (29) we can get a system of linear equations, which can be solved easily by 


matrix decomposition. 
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The state we estimate in this step is 
b, b b, c 
x=| vp d rtaV g A (32) 


Moreover, the estimated gravity vector can be refined by the known gravity magnitude. Generally, 
a vector in a three dimensional space is constraint by its magnitude and projection on three axes. If 
we know any three values of them, the other one can be uniquely determined. Since the magnitude 
of gravity is known, the degrees of freedom of gravity is two. We can parameterize the gravity with 


two variables on its tangent space as follow 
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in Eq. (31) by Eq. (33), and solve w, and w, recursively until converge (when w, and 


After refining gravity, we can recover the transformation from the cy frame to the world frame 
by rotating the gravity vector. We set the origin of the world frame at that of the c, frame. 
However, the heading angle can be any value because the number of unknown terms in the rotation 
matrix 1s nine (which is more than the knowns), and we cannot determine the absolute heading 
using a camera and a MEMS IMU. So we can rotate the world frame to ensure the heading angle 
of the first frame is zero. 

Now, the initialization process 1s completed. We have recovered the scale of visual odometry by 
aligning the IMU measurements with visual-based SFM. We have established the world frame 
whose z axis 1s parallel with the gravity vector. Additionally, we have estimated the real-scaled 


velocity in the world frame. 
IV. TIGHTLY-COUPLED NONLINEAR OPTIMIZATION 


4.1 Cost Function 


In the back-end optimization, the inverse depth of features (Why inverse depth? Because it has 
better numerical stability which is convenient to solve [12]), the pose and velocity of every frame, 


the IMU bias (gyroscope and accelerator) and the extrinsic parameters are optimized together. 


The state estimation can be thought as a Maximum A Posteriori (MAP) problem [13]. The 
estimation of the state is equal to calculating the conditional probability distribution of state 


quantities under conditions of known observations: 
P(x|z) (34) 

According to the Bayesian rule, we have 
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where P(x|z ) is the posterior probability; P(x) is the priori probability; P(z|x) is the 


P(x|z)= oc P(z|x) P(x) (35) 


likelihood probability. It is difficult to calculate the posterior probability distribution directly. 
However, it is feasible to find an optimal state estimation which maximizes the posterior probability. 


In most cases, we do not have any priori information about the state of the system. So, we have 
X =arg max P(x z) oc arg max P(z x) P(x) oc arg max P(z x) (36) 


The problem has evolved to finding a state of the system that can produce the observations most 
probably. 
We assume the uncertainty of measurement is Gaussian distributed, namely, z ~ N(z,Q). 


Then, the negative log-likelihood of (36) can be written as 


X =arg max P(z|x)=argmin 5 |z - &(x)||, (37) 
h(e) is a function of the state. lel Q denotes the Mahalanobis norm. 


NOTICE: Considering an arbitrary dimension Gaussian distribution, x ~ N ( u,Q) , the 
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In the VINS-Mono, the cost function can be written as 
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where (r, J p) is the priori information from marginalization; rp is the residuals of the IMU 
measurements; B is the set of all IMU measurements in the sliding window; rc is the residuals 
of the visual model; and C is the set of features which have been observed at least two times in 
the sliding window. 

The state estimation is converted to a nonlinear least square problem, and it can be solved by 
Gaussian-Newton or Levenberg-Marquardt approach. 


We take the Gaussian-Newton approach for example to solve a general optimization problem, 


Y= argmin | f (x) (41) 


denotes the two-norm. We take into 


where P is the covariance matrix of the residuals; 














account the first-order Taylor expansion of the cost function f (x) , then the problem become 
calculating the increment Ox. 
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Calculating the derivative of the function in (42) and let it equal to zero, we have 
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(43) 
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This can be solved by matrix decomposition, e.g. SVD. 
4.2 IMU Model 
As per Eq. (5), we can write the residuals of IMU measurements as following, 
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For IMU residual model, the optimization variables are 
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It is consistent with the code implementation. The dimensions of the state vectors above are 7, 9, 7, 
and 9, respectively. 
NOTICE: In the state vector, we use the quaternion to represent the attitude of the vehicle. 
However, it is not convenient to calculate the Jacobian w.r.t the attitude in quaternion. In addition, 
we know that the degrees of attitude 1s three while there are four variables in the quaternion which 
can overparameterize the problem. So we use Lie-algebra so(3) (namely, rotation vector) to 
perform perturbation analysis and iterative optimization when considering the attitude error. We 
calculate the increment of attitude in so(3), then we update the states by the quaternion of the 
attitude increment. This is implemented by instantiating ““LocalParameterization” in ceres solver 
[14]. Therefore, the dimensions of the Jacobian matrix in the IMU residual model should be <15x7, 
15x9, 15x7, 15x9>. Therefore the last row of J [0] and J [2] is zero. For a simple expression, 
we only write the left six rows of them (see Eq. (46) and (48)). The vison model is written as the 
same way. 

The explanation of the attitude perturbation analysis is shown in Appendix A. We perform 
perturbation analysis for Eq. (44), and the Jacobian matrix can be written as follows. The details 


can be found in Appendix B. 


W Ww Ww I Ww 
-R^ Ri (Pp. — Po, =v, At, + 58 a) |» 


yoy, alter) | ° aee e, | u 
7 e| py, 4; 0 [RA (py - pi +g" At) |x 
0 0 
0 0 
-R;A& -J; Ji, 
ap, ( 2^ 0 0 -r| (7) J ear" es; | Ji, 
J[1] s F (2.x) _ 7 Ee 
59? Gg ba Bee] | -Rì -Jf -Jf 
0 -I 0 
0 0 -I 





R^ 0 


Org (zz) | ^ (f 2) eae Oa, 


J Bie H 5 D 


Du ak+1 a 


Op Var NOS 48 
Phe Tot as] | 0 i 
0 0 
0 0 
0 0 0 
š 0 0 0 
Org (eit sx) b 
=|R% 0 0 (49) 
I 0 
0 I 


Cc 


4.3 Vision Model 


Considering the I" feature which is firstly observed in ds image, the residual for the 


observation inthe jJ á image can be written as 
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let 


where n. is the back projection model which outputs the correspondence normalized vector in 


3D space(see “projection factor.cpp" in the code); i P J is the first observation of the 1” 
feature in the i” image; D E is the observation of the / feature in the g image. The 


authors project the residual vector onto the tangent plane. |b, b, | are two arbitrarily selected 


orthogonal bases which span the tangent plane of P^ l 
NOTICE: The second equation of (50) can be derived by the following equations. 
R'P^ +P” - p" 
; j (52) 
P w~ -R,F » 


For Vision model, the optimization variables are 
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The Jacobian can be calculated by the chain rule, 
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For the tangent space form, we have 
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, we have the following equations. Details can be found in Appendix C. 
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V. MARGINALIZATION 
Since the number of states increase along with the time, the computational complexity will 
increase quadratically accordingly. In order to reduce compute burden without loss of information, 
the marginalization procedure is performed to convert the previous measurements into a prior term. 
The set of states to be marginalized is denoted as y,, and the set of remaining states is denoted as 


X, . According to (43), we rearrange the states’ order and get the following equation 
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Then we perform the Schur complement to carry out the marginalization [15, 16] as follows 
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Note that we keep states from instant 71 toinstant n inthe sliding window. The states before 


m are marginalized out and converted to a prior term. Therefore, the MAP problem can be written 


as that in [16], 
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Where S is the set of measurements. 


VI. GLOBAL OPTIMIZATION IN THE VINS-FUSION 


A general optimization-based framework for global pose estimation is proposed in [17], which 
is an extension of [5]. Local estimation from VIO/VO is fused with global sensor data in a pose 
graph optimization manner. Within the optimization, the transformation from the local frame to the 
global frame is estimated, so the local state can be aligned into the global coordinate system. The 
illustration of the global pose graph structure 1s shown in Fig. 4. The state vector to be estimated in 


global optimization is shown in Eq. (65). 





(x) State [| }+—— Magnctometer Factor 
——mn — Local Factor (VO/VIO) B — Barometer Factor 


B — GPS Factor Ng ----- Other Factor(s) 
Fig. 4 Illustration of the global pose graph structure. (Tong Qin, 2018) 
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Every node contains the pose of the vehicle in the global frame (Here is the GPS frame), while 
the edge between two consecutive nodes is a local constraint from VIO/VO estimation. The local 


pose and global pose of one node can be written as a combination of rotation and translation, 
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where i isthe i" node; | is the local reference frame; G is the global reference frame; R 


and T are the rotation and translation from body frame to reference frame, respectively. Given two 
consecutive nodes i and J, the relative transformation can be derived from local pose and global 


pose, respectively. In the code, the authors set the first GPS point as the origin and subsequently set 
the ENU (East-North-Up) coordinate system as global frame. 
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Then, the residuals can be written as 
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For GPS position constraint, we have 
G G 

óP, = |P estimation — P, _ measurement | (70) 


The standard deviation (std) of the global position error can be obtained from GPS positioning 


algorithm directly. In the code [7], the std of local position and rotation are fixed as, 0.1 m for 


position error std, and 0.01 rad for rotation error std, respectively. The two factors are jointly 
optimized by ceres solver. Since the pose graph is quite sparse, the authors keep a huge window to 
get accurate and globally drift-free pose estimation ( not implemented in the code). Although many 
nodes are optimized, the transformation between the local frame and the global frame is updated 
only by the global pose of the last node. As far as I am concerned, all the nodes should be taken 
into account to reduce the effect of possible coarse errors. 

However, the time synchronization between the VIO and GPS measurements is not decently 
handled in the code, which may introduce significant pose error. Moreover, the integration of GPS 
and VIO is in a loosely-couple fashion; thus, the GPS observations are not used to correct the sensor 
errors of the IMU. 


VII. APPENDIX 


A. Attitude Perturbation 


In most optimization problems, we calculate the Jacobian of residuals w.r.t. the variables at first. 
Then we use the output increment to update the state of the vehicle. So the way that we compute 
the Jacobian is relative to the way we update the state of body. 

There are many attitude representations to parameterize the attitude of vehicles. There definitions 
and the conversions from one form to another can be found in [10, 18]. It is worth mentioning that 
the rotation vector and rotation matrix correspond to the Lie-algebra so(3) and Lie-group SO(3), 
respectively. 

For instance, the authors in [5] use the quaternion to represent the attitude of the vehicle, and the 
rotation matrix to represent the transformation between two coordinates. The authors define the 


update way of the vehicle attitude as (see “pose local parameterization.cpp" in the code) 


w we b 
Vy = % Fy (71) 
where, q, is the vehicle attitude derived from the system dynamic eqation; Óq; is the 


increment calculated by the nonlinear optimization. qj is the attitude after optimization. So the 


perturbation should be the attitude error in b-frame (from current b to b’ ) when we perform 


perturbation analysis on the residual function. For example, 
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Where, Ry (80) and R? (80) are the attitudes with perturbation; 50, is the rotation vector 


corresponding to the attitude perturbation. 


B. The Jacobian of The IMU Model 


Given Eq. (71) and (72), we have 
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Perform the Lie-algebra left multiplication perturbation model [13], we have 
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One thing that we must pay attention to 1s that the cross production does not confirm the associative 
law, namely 
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From [10], we know 
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For unit quaternion, we have 
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where, L [e] and R [e] are respectively the left- and right- quaternion-product matrices, [e] 4,518 the 


3x3 block atthe right-bottom of the matrix. 
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As the same as Eq. (74), we have 
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Given (20), we 
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As the same as Eq. (83) and (84), we have 
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C. The Jacobian of The Vision Model 
Given Eq. (71) and (72), we have 
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